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Abstract 

We study a reinforcement learning for temporal coding with neu- 
ral network consisting of stochastic spiking neurons. In neural net- 
works, information can be coded by characteristics of the timing of 
each neuronal firing, including the order of firing or the relative phase 
differences of firing. We derive the learning rule for this network and 
show that the network consisting of Hodgkin-Huxley neurons with the 
dynamical synaptic kinetics can learn the appropriate timing of each 
neuronal firing. We also investigate the system size dependence of 
learning efficiency. 



1 Introduction 

Many studies have assumed that neurons transmit information by their firing 
rate. The McCulloch-Pits unit is a typical model and networks of these 
units have been investigated. On the other hand, recent experiments suggest 
that the timing of neuronal firing may also contribute to the information 
representation function in the brain and the synaptic modification (Gray, 
Konig, Engel, & Singer, 1989; Bi & Poo, 1999; Varela, Lachaux, Rodriguez, 
& Martinerie, 2001; Reyes, 2003). For example, it seems that local and global 
synchronization play a significant role in integration of information which is 
distributed across the brain. Another example shows that the order of timing 
of neuronal firings can encode the information of stimuli on fingertips, and 
this encoding by sequence can transmit information faster than coding from 
the firing rate directly (Johansson & Birznieks, 2004). 

To capture the dynamical aspects of neural networks, networks consisting 
of various model neurons other than the McCulloch-Pitts unit have been 
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investigated, because McCulloch-Pitts units cannot describe the temporal 
behavior of neurons over short time scales. In this context, an associative 
memory for neural networks of oscillator neurons or spiking neurons has been 
studied (Aoyagi, 1995; Yoshioka & Shiino, 1998; Hoppensteadt & Izhikevich, 
1999; Kanamaru & Okabe, 2000; Hasegawa, 2001; Lee & Farhat, 2001, 2002). 
In these systems, the relative phase differences, i.e., the timing of firings, are 
used to represent the memory. 

There are few studies of learning in pulse neuron models such as those 
consisting of Hodgkin-Huxley (HH) neurons because of difficulty in deriving 
the learning rule. Although several studies have been made of learning in 
networks that consist of integrate-and-fire (IF) model neurons (Seung, 2003; 
Xie & Seung, 2004; Cios, Swiercz, & Jackson, 2004), most of these studies 
focus only on coding in terms of the firing rate. 

However, it would be useful to combine temporal coding and learning 
because it has been shown that temporal coding can deal with more infor- 
mation and process it faster than coding from just the firing rate (Thorpe, 
Delorme, & Rullen, 2001). As an example of this, Delorme et al.(2001) show 
that a neural network consisting of IF neurons can learn to identify human 
faces by using "Rank Order Coding", i.e., coding by the order of timing of 
each neuronal firing, where neurons are allowed to spike once only. 

In this paper, we study a reinforcement learning for temporal coding with 
neural network consisting of stochastic spiking neurons. After defining a 
network of coupled stochastic HH neurons and some quantities in Sec. II, we 
train the network to learn an XOR operation, where the output information 
is coded by the order of firing in Sec. III. In Sec. IV, we investigate how 
the result or performance of learning depends on the system size and the 
strength of noise, and conclusions follow. 

2 The model 

To illustrate an example of the learning process of spiking neurons, we con- 
sider a neural network consisting of HH neurons. Since HH neurons show 
excitability, they can code information by the timing of firing. The complete 
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dynamics for a network of coupled HH neurons may be expressed as 

Cm lF = 9Nam%{V Na - Vi) + g K n\(y K - V) + g L (V L - V) 

+ +£(*), (1) 

3 

di = 0£ m (Vi)(l - rrii) - P m (Vi)m i: (2) 
^ = a h (V i )(l-h i )-p h (V 1 )h l , (3) 

— - = «„(v;)(i - m) - PnWm, (4) 

at 

where Vi is the membrane potential of neuron i, C rn the membrane capac- 
itance, V r (r — Na,K,L) are the equilibrium potentials, g r (r = Na,K,L) the 
conductance, rrii,hi, Hi the voltage dependent activating or inactivating vari- 
ables, a x and (3 x {x = m, h, n) the functions of voltage Vi (Hodgkin & Huxley, 
1952), Wij is the synaptic weight from neuron j to i (wij ^ Wji in general), 
Ij(t) the synaptic current and £i(t) is the Gaussian white noise which obeys, 



m = o, (5) 



6(W)= QSijSit-t'), (6) 

where A is the average of A over time and Q the variance of noise. The 
synaptic current Ij(t) is given by 

F j (t)=r J (t)[Vsyn-V i \, (7) 

where V syn is the synaptic reversal potential and Tj(t) the fraction of bound 
receptors (Destexhe, Mainen, & Seijnowski, 1994) described by 

(IT ■ 

-^ = oO'(t)(l-r j )-Pr j , (8) 

{1 t° < t < t° + r 
tj S < tj + T, (g) 
otherwise, 

where a = 0.94 ms -1 , f3 = 0.18 ms" 1 , t° the time when the presynaptic 
neuron j fires (membrane potential over 27 mV) and r = 1.5 ms (Fernandez, 
Huerta, Corbacho, & Sigiienza, 2000) Fig. 1 shows the behavior of Vi(t) and 
Ti(t) of single neuron added the external current whose amplitude is 10 mA. 
We used the forth order Runge-Kutta method with the time step At = 0.01 
to solve Eqs. (1)~(4). 
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Figure 1: The behavior of Vi(t) and Tj(t) of single HH neuron added the 
external current. The neuron fires at t — 2 ms, then the value of rj(i) starts 
to increase. After a lapse of r = 1.5 ms, r^t) turns into decline. 
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To train the neural network, we use a reinforcement learning algorithm. 
Let us consider time sequences of states of neurons; a = (V(0), V(l), V(2), • • • , V(T)), 
where V(t) denotes the vector (Vi(t), ■ ■ ■ , V)v(t)). We assign a scalar value 
( "reward" ) to each time sequence a according to the signal from the network 
(Sutton & Barto, 1998). We give a high reward R to the desirable time se- 
quence a in each episode. Here we consider episodic learning. Since Eq. (1) 
includes the Gaussian white noise, we calculate the expected value of the 
reward (R), where (• • • ) signifies the average over all possible time sequences 
a. Then the goal of learning is to maximize (R) by adjusting to^. We use an 
ascending gradient strategy; 



where e is the learning coefficient. We can calculate the gradient of (R) with 
respect to Wij (Hayakawa, 2001; Fiete & Seung, 2006), 



For details of the derivation, see appendix. 

3 Learning procedure for temporal coding 

The present learning rule Eq. (10)~(12) can be effective for any information 
coding including the order coding. To show an example, we consider a neural 
network consisting of 2 input neurons, 15 output neurons and hidden neurons. 
We divide the set of output neurons into three disjoint subsets, Oi, Oi and 
03, each containing 5 output neurons. 

Here, we assume that information is coded by the temporal order of the 
"group activity" of subsets Oi, 2 and O3. As a learning goal, we chose 
an XOR operation in terms of order coding; the subsets Oi, C 2 and O3 
should fire in this order for input patterns [—1,-1] and [1,1], and in reverse 
order for input patterns [—1,1] and [1,-1]. We train the neural network to 
learn the suitable timing of firing in the subset O p for given input pattern 
k = [—1, —1], [—1, 1], [1, —1] and [1, 1], respectively, as described in Table 1. 
The right column of Table 1 shows the desirable order of group activity of 
subsets for each input pattern. The learning goal is the reduction of the RMS 
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Table 1: The order of collective firing as a learning goal. t\ = 2.0, t 2 = 4.0 
and t 3 = 6.0, respectively. 



Input 
pattern 


t k 


Desirable 
order 




2 


3 


[-1,-1] 


h 


t 2 


h 


01, 02, 03 


[1,-1] 


h 




h 


03, 02, 01 


[-1,1] 


h 




h 


03, 02, 01 


[1,1] 




h 


h 


01, 02, 03 



errors E k for each input pattern k given by 

° p=i ieOp 

where is the time when the neuron i fires and N the number of output 
neurons. 

We define an "episode" as the transient dynamics of the network which 
lasts from t — to t — T accompanied with external inputs. We evaluate 
the reward R depending on the resulting error E k through each episode for 
given input pattern as 

^l 1 E l <E r a*) 

where E k h is a threshold of error. We change the threshold E k h gradually as 
the learning proceeds. In the following simulations, we calculate the aver- 
age error E k vc over 100 episodes for each input pattern, and we update the 
threshold for the next 100 episodes by E k h = 0.99 x E k ve . E k h and E k vc are 
almost identical. It sometimes happens that E k ve increases through learning 
since the update of Wij to decrease E k will affect the value of E k ' where 
k ^ kl . Thus, E k h also sometimes increases, not decreases monotonically. 

During each episode, the input signals are fixed to the time-independent 
value, // = Io or — 1 , corresponding to the input patterns. (Here, I = 10 
niA.) The learning coefficient, the variance of noise and the period of one 
episode are e = 0.001, Q = 10000 and T = 10.0 ms, respectively. 

We assume that all the output and hidden neurons are excitatory, i.e., 
Wij for j $l X can take only positive value, where X is the set of input 
neurons, while for j G X can take both positive and negative value. We 
first set the synaptic weights for j G X using the uniformly distributed 
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random number in [—1.0, 1.0], while for j £ X between [^, ^], where A" is 
the number of neurons. Due to this normalization factor the magnitude 
of error E k does not depend on N. In this section, the number of hidden 
neurons was 5. 

Fig. 2 shows the behavior of the output neurons at initial state (upper) 
and that after 5 x 10 4 episodes for each input pattern (that is, 2 x 10 5 episodes 
in total) elapsed (lower). We find that neurons in Oi, O2 and O3 fire in this 
order for the input patterns [—1,-1] and [1,1], while in reverse order for 
the patterns [1, —1] and [—1, 1] after learning. In fact, Fig. 3 shows that 
the average error E^ vc decrease gradually through learning. These figures 
indicate that learning in terms of the order of firing by applying the present 
learning rule is successful. 

Fig. 4 shows Wij in a matrix form representing the connections with color 
after learning. As can be seen, there is few connection between 0\ and O3 
as a result of learning. This can be interpreted as the consequence that there 
is no direct correlation between 0\ and O3 in the both of the output time 
sequence. Few connections from X to O2 can be understood in the same way. 

4 System size dependence of learning efficiency 

Since realistic biological systems consist of many neurons, it is interesting 
and important how efficient this learning rule is for larger system size. In 
this section, we discuss the system size dependence of the performance of 
learning 

At first, we train the network for various number of hidden neurons Ah- 
In the following simulation, all parameters but Ah are given as the same as 
previous section. The sum of average errors E* ve are shown in Fig. 5(a). 
To investigate the property of learning, we assumed an exponential decay of 
the error ~ exp(— As) as the function of the number of episode s and showed 
A in inset of Fig. 5(a). As shown in Fig. 5(a), the speed of learning becomes 
slower when extra hidden neurons are added, and, decay rate A also has the 
same tendency. However, the error is significantly reduced as the episodes 
proceed. In this case, it turned out that the network with Ah = 5 showed 
best result in this simulation so that those added neurons are redundant and 
do not contribute learnability. 

Next, we examine the case the learning speed for various number of output 
neurons A^o, where each subsets O p consist of A^o/3 and the number of hidden 
neurons A" H = 5. Fig. 5(b) shows the error J2 k E* ve and the decay rate A 
(inset). In this first stage of the learning process (s < 10 4 ), decay rate is 
approximately given as A ~ Nq, where 5 ~ —2. However, if A is evaluated 
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Figure 2: The action potential of each output neuron at initial state (upper) 
and after 5 x 10 4 episodes for each input pattern (that is, 2 x 10 5 episodes 
in total) elapsed (lower). The index of each output neuron is shown on the 
vertical axis. The neurons with index < 5 are in Oi, those with index > 11 
are in G 3 , and the others are in (9 2 - Oi, @2 and O3 fire in this order for 
input patterns [-1,-1] an d [1,1], while in reverse order for [1,-1] and [-1,1]. See 
also Table 1. 
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Figure 3: E* ve for each input pattern as a function of episodes through learn- 
ing. Inset is the same data in semi-logarithmic plot. Typical value of E k 
before learning is approximately 3. 




Figure 4: Wij in a matrix form representing the connections with color after 
learning. Neuron j is presynaptic and % is postsynaptic. 
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Figure 5: X^-^ave as a function of episodes through learning for each iVn(a) 
and iVo(b). Inset: The decay constant A of ^2 k E^ ve for < s < 10 4 (filled 
circle) and 4 x 10 4 < s < 5 x 10 4 (open square) where s is the number of 
episode. The slope of dotted line in inset (b) is approximately —2. 
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Figure 6: ^ k E* ve as a function of episodes through learning for each Q. 
Inset: A of J2 k E te for < s < 10 4 (filled circle) and 4 x 10 4 < s < 5 x 10 4 
(open square). 



in the interval 4 x 10 4 < s < 5 x 10 4 , the Nq dependence of A seems to be 
more weaken, so that a long-term behavior of the learning process needs to 
be investigated to discuss the asymptotics. 

We would like to mention the effects of noise in the learning. The noise 
strength may affect the result or performance of learning because a random 
search process in the weight space is the essential part of this learning algo- 
rithm. We tested the dependence of the learning performance on the variance 
of noise Q in Eq. (6) , where Nq = 15, iV H = 5 and other parameters are 
given as previous section. Fig. 6 shows the learning error and its decay rate. 
Since the amount of update is in the order of \e/Q\, the magnitude 

of Q will directly affect the stability of learning. However, Fig. 6 indicates 
this learning rule is robust against the change of Q as long as the gradient 
ascending process assumed in Eq. (11) holds. 



5 Conclusions 

We have proposed a concrete method to encode temporal information for 
a network of stochastic spiking neurons utilizing a reinforcement learning 
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method, and we show that this learning rule is effective for order coding. 

Since the derivation of Eq. (11) does not depend on the form of the 
differential equation for the action potential like Eq. (1), the learning rule 
Eq. (11) is generally applicable to networks of other types of model neurons, 
for example neurons following the IF or FitzHugh-Nagumo (FHN) model 
with random fluctuation. In fact, we have checked the network consisting 
of IF or FHN neurons can also learn the same order coding task with the 
learning rule Eq. (11). (We omit these details in this paper.) 

The present learning rule can be applied not only to the order coding, but 
also to any information coding. In general, if one can design an appropriate 
reward R by defining the error of group activity from the reference signals, 
the learning process is given in a straightforward manner as described in 
Sec. III. For example, we confirmed that the present method also worked for 
the XOR task in terms of phase coding, where the relative phase difference 
codes information in the same way as oscillator neurons (Aoyagi, 1995). In 
the simulation, neural network consisted of 2 input neurons, a few hidden 
neurons and 10 output neurons which were divided into two disjoint subsets, 
0\ and O2, each containing 5 output elements. Learning goal was for group 
activity in 0\ and O2 to become in phase for input patterns [—1,-1] and 
[1, 1], and to be out of phase for [1, —1] and [—1, 1]. 

In general, convergence of the reinforcement learning is slow and the 
present model is not the exception. Several factors need to be investigated 
to improve the speed of convergence. Although we have assumed Gaussian 
white noise in this study, in real neuronal systems it is likely that noise may 
obey other statistical properties. For example, Fiete et al. (2006) discusses a 
learning rule using the arbitrary noise. Furthermore, learning speed depends 
to a great extent on the design of the cost function, i.e., the reward. If 
the design of the reward is not suitable, the learning process drops into 
local minima easily. In the present study, we empirically employed a sliding 
threshold method by changing the learning goal E k h in Eq. (14) started from 
easier one. Although this strategy can be applied to any learning tasks, the 
performance or result of learning is sensitive to define E k h . In this paper, we 
have defined E k h as A x E k vc where A = 0.99, because A = 0.9 is too severe 
to learn so that the speed of learning becomes slower and A = 1.0 causes 
overestimated reward so that the learning process drops into local minima. 
The point that E k h does not monotonically decrease is also important since 
the update of Wij to decrease E k will affect the value of E k ' where k ^ k! . 
Therefore, if we decrease E k h monotonically while E k ve increases because of 
this influence with update of w^, the task becomes harder to learn. As well 
as the conventional back propagation learning, how to decide the number of 
hidden neurons and assign the initial value of are also important problem. 
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In the XOR task with the order coding, convergence speed in learning process 
significantly affected by both the number of hidden and output neurons, 
larger size system seems to require more learning steps. We intend to examine 
this relationship in future research. 

We have assumed a fully connected network and assigned the random 
value to the initial state of Wij, which may not be appropriate for learning. 
Watts and Strogatz have proposed a "small-world" network structure (Watts 
& Strogatz, 1998). This network has been investigated in the context of a 
multi-layered feed-forward network, and some improvements in performance 
have been made (Simard, Nadeau, & Kroger, 2005). Also, the propagation 
velocity and coherent oscillation of IF or Hodgkin-Huxley neurons depend on 
the topology of network (which includes small-world network) (Fernandez et 
al., 2000; Roxin, Riecke, & Solla, 2004). These results may indicate that there 
are some suitable topologies of network for temporal coding as well. Since the 
present learning rule Eq. (12) includes only the local relation between neuron 
i and j, this learning rule may be applied to a network with an arbitrary 
topology. Finding a suitable network topology to facilitate learning is also a 
problem for future investigation. 
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Appendix : Estimation of gradient of (R) 

Let us introduce a stochastic processing element as a model neuron. The 
value of action potential of i-th neuron at discrete time t(= 0, 1,2, • • •) is 
represented by Vi(t). Vi(t) and other internal state rriiit) at successive time 
step t + 1 are then determined as 

N 

Vi(t + 1) = g(Vi(t), mi {t)) + mifWt)) + (15) 

3 

m i (t + l) = h(y i (t) 1 m l (t)), (16) 

where is the synaptic weight from neuron j to i, f(Vi(t)) the signal 
which i-th neuron emits according to Vi(t), N the total number of neurons, 
g(Vi(t),rrii(t)) and h(Vi(t), rrii{t)) are functions of the neuronal states, and 
is the Gaussian white noise which obeys Eq. (5) and Eq. (6). We consider 
that f(Vi(t)), g(Vi(t),mi(t)) and h(Vi(t) , m^i)) are arbitrary functions. 
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As a continuous time representation, we employ a Langevin equation for 
neuronal updates; 



N 



dm; 



dt 



h(Vi(t),mi(t)). 



(17) 



(18) 



Although we treat m,i(t) as a scalar variable in the following discussion, the 
same method can also be applied if m,i(t) is multivariate. 

We consider time sequences of states of neurons for t — 0, • • • , T; 
a = (V(0), V((l), V(2), • • • , V(T)), where V(t) denotes the vector 
(Vi (*),••• .Viv^.mi^),--- 

We assign a reward i? to each time sequence a and calculate the gradient 
of (R) with respect to (Hayakawa, 2001; Fiete & Seung, 2006). The 
probability density T a for each time sequence is defined as 



% = P(V(T)\V(T - 1)) • • -P(V(1)|V(0)), 



(19) 



where P(A\B) is the transition probability from state B to A. Since rrii(t) is 
deterministic and only Vi(t) includes the Gaussian noise described by Eqs. (5) 
and (6), the transition probability in Eq. (19) may be expressed as 



p(v(f+i)iv(f))=n 



exp 



(2ttQ)V2 

x8mi(t+l),h(Vi(t),mi(t)) 



exp 



2Q 



Us 



mi(t+l),h(yi(t),mi(t)), 



(20) 



where 



Vi(t) = 



N 



V t (t + l)-g(V t (t),m z (t)) -J^Wijfimjit)) 



(21) 



and 5 mjn is the Kronecker delta. The average of a quantity Y(cr) over all 
possible time sequences can be described by 



l^r^exp \-^J2J2m(t) 



(22) 
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where Z is a normalization factor. Therefore, the gradient of Y(a) may be 
obtained as 



UW l3 (j V t=o 



(23) 



In the continuous case (17), the sum over a in Eq. (22) becomes the path 
integral 



(Y) 



1 

Z 



Y(a) exp 



1 

2gy 



d(path), 



where 



Vi(t) = 



dVi 



N 



-^-gm),m z (t))-J2^m(t)) 



(24) 



(25) 



Using R(a) as Y(a), we obtain the derivative in Eq. (11) as 
d(R) 1 1 ' T 



(26) 



For the network consisting of HH neurons (1)~(4), we can derive Eq. (12) 
by instituting Eq. (7) to Eq. (26). 

Thus, this result indicates that the calculation of the gradient of (R) 
depends only on £«(£), f(Vj(t)) and R(a). The linear summation of the pre- 
synaptic neuronal activity and an additive independent noise as in Eq. (15) 
and Eq. (17) is essential for the present learning algorithm. This learning 
rule includes only the local relation between neurons % and j, while back- 
propagation (Rumelhart, Hinton, & Williams, 1986) or RTRL (Williams & 
Zipser, 1989) requires the information of other neurons in addition to i,j in 
order to calculate 5wij. 
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